Major overhaul of mbstring (part 31) #10591

alexdowad · 2023-02-14T19:37:12Z

One more step towards eliminating the legacy encoding conversion code.

mb_decode_mimeheader is now around 3 times faster.

FYA @cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai

The new implementation is 2.5x-3x faster. If an invalid charset name was used, the old implementation would get 'stuck' trying to parse the charset name and would not interpret any other MIME encoded words up to the end of the input string. The new implementation fixes this bug. If an (invalid) encoded word ends abruptly and a new (valid) encoded word starts, the old implementation would not decode the valid encoded word. The new implementation also fixes this. Otherwise, the behavior of the new implementation has been designed to closely match that of the old implementation.

mvorisek · 2023-02-14T20:41:16Z

ext/mbstring/mbstring.c

@@ -5641,3 +5613,232 @@ static void php_mb_gpc_set_input_encoding(const zend_encoding *encoding) /* {{{
 	MBSTRG(http_input_identify) = (const mbfl_encoding*)encoding;
 }
 /* }}} */
+
+static int8_t decode_base64(unsigned char c)


can standard base64 decode function be used here? especially with #6361

Thanks for the suggestion. I'll have to check what its interface looks like and see if it fits what is needed here.

Hmm, the functions provided by base64.c take a pointer/length pair (good) and return a zend_string (not what I wanted).

But maybe I can adjust the code to use a zend_string. Let me think about that a bit.

Or perhaps it's possible to make the other code internally work with a pointer/length pair. And let the existing callers of that code work with a wrapper for zend_string that extracts the pointer & length.

kamil-tekiela · 2023-02-14T21:50:09Z

It would be good to know what this function does. 😄 The PHP manual isn't very verbose https://www.php.net/manual/en/function.mb-decode-mimeheader.php
This comment actually seems to help a little bit https://www.php.net/manual/en/function.mb-decode-mimeheader.php#107692
I wonder if there is a functional specification for this. Did you adhere to some documentation?

alexdowad · 2023-02-15T06:31:08Z

It would be good to know what this function does. smile The PHP manual isn't very verbose https://www.php.net/manual/en/function.mb-decode-mimeheader.php This comment actually seems to help a little bit https://www.php.net/manual/en/function.mb-decode-mimeheader.php#107692 I wonder if there is a functional specification for this. Did you adhere to some documentation?

@kamil-tekiela Thanks for the comment. This might be helpful: https://en.wikipedia.org/wiki/MIME; scroll down to the subheading "Encoded-Word".

The MIME standards (defined in various IETF RFCs) allow e-mail header values to include non-ASCII characters, by encoding them in a special format. mb_decode_mimeheader decodes this special format to an 'ordinary' string.

The input string for mb_decode_mimeheader should be the value of one field-value pair from an e-mail message's headers. It does not accept all the headers at once; the user has to break the headers up into field names and values and pass the values one by one through mb_decode_mimeheader to decode them.

I briefly mentioned the relevant standard in the code comments:

/* Decode MIME encoded word as defined in RFC 2047 */

alecpl · 2023-02-15T09:48:03Z

FYI, there's also iconv_mime_encode/iconv_mime_decode. Also, I'd like to note Mail_Mime does not use any of these https://github.com/pear/Mail_Mime/blob/master/Mail/mimePart.php#L853

alexdowad · 2023-02-15T09:55:15Z

FYI, there's also iconv_mime_encode/iconv_mime_decode. Also, I'd like to note Mail_Mime does not use any of these https://github.com/pear/Mail_Mime/blob/master/Mail/mimePart.php#L853

Thanks for pointing that out!

I really wonder why we have two different built-in functions in PHP which do the same thing.

alecpl · 2023-02-15T10:59:12Z

I really wonder why we have two different built-in functions in PHP which do the same thing.

Yeah. I don't know. It might be that mbstring wasn't always been a "core extension". Also iconv supports a different (wider?) set of charsets.

youkidearitai · 2023-02-15T12:04:51Z

I found blog that about talk PHP 8.0 and PHP 8.1 or later of mb_encode_mimeheader and mb_decode_mimeheader result is different. That blog is https://blog.cretbird.co.jp/mb_encode_mimeheader-php81.html.

Imitate this behavior, I confirmed different to this PR and PHP 8.1 or later and PHP 8.0. Code is below.

<?php
var_dump(
    mb_decode_mimeheader(
        mb_encode_mimeheader(
            mb_convert_encoding(
                "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやいゆえよらりるれろわゐうゑをん",
                "ISO-2022-JP",
                "UTF-8"
            ),
            "ISO-2022-JP"
        )
    )
);

This PR result is string(152) "あいうえおかきくけこさしすせそたちつて?箸覆砲未佑里呂劼佞悗曚泙澆爐瓩發笋い罎┐茲蕕蠅襪譴蹐錣陲Δ颪鬚??", but PHP 8.1 or PHP 8.0 is 3v4l here. Why is different behavior?

youkidearitai · 2023-02-15T14:09:14Z

Result of mb_encode_mimeheader is not seems different. https://3v4l.org/BZEYU

zbenc · 2023-02-15T14:49:16Z

@MaxKellermann There is also the following known issue with this function: https://bugs.php.net/bug.php?id=68821

This issue makes it behave differently to imap_mime_header_decode(), which IMO should be equivalent.

So in order to properly handle MIME mail headers one must use this:

function correctDecodeMIME($string) {
	$string = \preg_replace_callback('/(=\?[^?]++\?[Qq]\?)([^?]++)(\?=|$)/D', static function ($match) {
		return $match[1] . \str_replace('_', '=20', $match[2]) . $match[3];
	}, $string);

	return \mb_decode_mimeheader($string);
}

…encoding

alexdowad · 2023-02-17T12:06:52Z

@MaxKellermann There is also the following known issue with this function: https://bugs.php.net/bug.php?id=68821

This issue makes it behave differently to imap_mime_header_decode(), which IMO should be equivalent.

Thanks for pointing that out. I reviewed the relevant RFC and you are 100% right; mb_decode_mimeheader does not handle underscores correctly.

I have added a fix to this PR.

alexdowad · 2023-02-17T12:36:46Z

This PR result is string(152) "あいうえおかきくけこさしすせそたちつて?箸覆砲未佑里呂劼佞悗曚泙澆爐瓩發笋い罎┐茲蕕蠅襪譴蹐錣陲Δ颪鬚??", but PHP 8.1 or PHP 8.0 is 3v4l here. Why is different behavior?

@youkidearitai As usual, thanks for the great review! You have uncovered an interesting issue here.

I haven't completed my investigation yet, but one first point: the test code which you kindly provided is buggy. The documentation for mb_encode_mimeheader says:

string
The string being encoded. Its encoding should be same as [mb_internal_encoding()](https://www.php.net/manual/en/function.mb-internal-encoding.php).

In the code sample, mb_internal_encoding is not ISO-2022-JP, but you are passing an ISO-2022-JP string to mb_encode_mimeheader.

Here is a sample where mb_internal_encoding is set correctly: https://3v4l.org/pCV0d

alexdowad · 2023-02-17T13:02:44Z

@youkidearitai OK, more findings...

In PHP 8.1, I adjusted the mbstring conversion code for some legacy text encodings to flag errors rather than ignoring them. For example, in ISO-2022-JP, various encoding modes can be selected using escape sequences like \x1B$B. Now, consider this ISO-2022-JP string:

"abc\x1B"

How should we treat this string? The useless \x1B byte should not be there; \x1B is expected to start an escape sequence in ISO-2022-JP, but this one does not.

When PHP 8.0 mbstring processes this string, it ignores the trailing ESC byte and treats the string as valid. (mb_get_info('illegal_chars') is not incremented.) PHP 8.1 mbstring treats the trailing ESC byte as an error, outputs an error marker, and increments mb_get_info('illegal_chars').

This is the reason why PHP 8.1 mb_decode_mimeheader produces an extra ? character in the test case which you found; it's because there is an invalid escape sequence in the string returned by mb_encode_mimeheader.

Now, why do we have an invalid escape sequence in the string returned by mb_encode_mimeheader?? Again, it's because you set mb_internal_encoding incorrectly. Because mb_internal_encoding is set to UTF-8, mb_encode_mimeheader treats your ISO-2022-JP string as UTF-8 (and actually, it is valid UTF-8).

In ISO-2022-JP, "\x1B$B" is a single escape sequence, which cannot be divided without changing the meaning. But in UTF-8, those are just 3 ordinary single-byte characters. Because mb_encode_mimeheader is treating the input string as UTF-8, it has no problem in splitting up those 3 bytes, putting the first one in the first MIME encoded word and the remaining ones in the second MIME encoded word.

So basically:

Because mb_internal_encoding was set wrongly, mb_encode_mimeheader produced wrong output. This wrong output was the same for both PHP 8.0 and PHP 8.1.
When mb_decode_mimeheader sees the invalid input, it ignores the problem in PHP 8.0, but in PHP 8.1 it alerts you to the problem by producing an error marker.

alexdowad · 2023-02-17T13:14:18Z

I found blog that about talk PHP 8.0 and PHP 8.1 or later of mb_encode_mimeheader and mb_decode_mimeheader result is different. That blog is https://blog.cretbird.co.jp/mb_encode_mimeheader-php81.html.

Solution #1 mentioned by Satou-san in the above-referenced blog post is the correct one:

１．mb_internal_encodingをISO-2022-JPに指定し、変換処理後に元に戻す

He says in the post that when he tried this, he found some strings which were corrupted when passed through mb_encode_mimeheader and mb_decode_mimeheader. But unfortunately, he doesn't provide any test cases, so we can't diagnose what actually went wrong.

alexdowad · 2023-02-17T13:54:49Z

Solution 1 mentioned by Satou-san in the above-referenced blog post is the correct one:

There's actually another solution which it seems Satou-san didn't think about: provide a UTF-8 header, and let mb_encode_mimeheader convert it to ISO-2022-JP. Like this:

mb_internal_encoding('UTF-8');
$str = "This is a UTF-8 string. あいうえお。您好。";
mb_encode_mimeheader($str, 'ISO-2022-JP');

youkidearitai · 2023-02-18T07:22:21Z

But unfortunately, he doesn't provide any test cases, so we can't diagnose what actually went wrong.

Indeed, Satou-san only presentation some one, we can't know what to do. We tried many workaround because multibyte mail is often mojibake.

alexdowad · 2023-02-18T07:35:18Z

Indeed, Satou-san only presentation some one, we can't know what to do. We tried many workaround because multibyte mail is often mojibake.

Sorry to hear that. I am ready to fix anything which is wrong, but first, we need test cases which demonstrate the problem.

You have provided one very nice test case, which demonstrates very well what can happen when mb_internal_encoding is wrongly set. Now we need more test cases which show how mojibake can occur even when mb_internal_encoding is set correctly.

If you, or anyone else, can come forward with such a test case, I am ready to analyze the root cause. Understanding the root cause is very necessary before anything can be done about it.

alexdowad · 2023-02-22T16:45:43Z

Hmm. I was hoping that maybe @youkidearitai might come up with some other test cases related to the reports of mojibake, but it seems that none are forthcoming now...

I am inclined to merge this PR "as is" and leave the issue of using the SIMD-accelerate Base64 decoder as a possible future enhancement.

If anyone strongly feels we need SIMD-accelerated mb_decode_mimeheader now and not later, please speak up....

Girgias

Don't know about mime header encoding, but looks reasonable.

alexdowad · 2023-02-22T21:21:18Z

Landed... thanks all.

youkidearitai · 2023-02-22T22:24:25Z

Hmm. I was hoping that maybe @youkidearitai might come up with some other test cases related to the reports of mojibake, but it seems that none are forthcoming now...

I'm sorry for waiting me. But mail mojibake know-how is almost closed part of company. The problems reported in php-src only a fraction...

alexdowad added 3 commits February 8, 2023 21:26

Remove unneeded function mbfl_name2no_encoding

76c17df

Remove unneeded function mbfl_no2preferred_mime_name

c3a22fb

github-actions bot added the Extension: mbstring label Feb 14, 2023

mvorisek reviewed Feb 14, 2023

View reviewed changes

mb_decode_mimeheader obeys RFC 2047 regarding underscores and QPrint …

ef85122

…encoding

Girgias approved these changes Feb 22, 2023

View reviewed changes

alexdowad closed this Feb 22, 2023

alexdowad deleted the mime branch February 22, 2023 21:21

Major overhaul of mbstring (part 31) #10591

Major overhaul of mbstring (part 31) #10591

Uh oh!

Conversation

alexdowad commented Feb 14, 2023

Uh oh!

mvorisek Feb 14, 2023

Choose a reason for hiding this comment

Uh oh!

alexdowad Feb 14, 2023

Choose a reason for hiding this comment

Uh oh!

alexdowad Feb 15, 2023

Choose a reason for hiding this comment

Uh oh!

nielsdos Feb 15, 2023

Choose a reason for hiding this comment

Uh oh!

kamil-tekiela commented Feb 14, 2023

Uh oh!

alexdowad commented Feb 15, 2023

Uh oh!

alecpl commented Feb 15, 2023

Uh oh!

alexdowad commented Feb 15, 2023

Uh oh!

alecpl commented Feb 15, 2023

Uh oh!

youkidearitai commented Feb 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youkidearitai commented Feb 15, 2023

Uh oh!

zbenc commented Feb 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexdowad commented Feb 17, 2023

Uh oh!

alexdowad commented Feb 17, 2023

Uh oh!

alexdowad commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexdowad commented Feb 17, 2023

Uh oh!

alexdowad commented Feb 17, 2023

Uh oh!

youkidearitai commented Feb 18, 2023

Uh oh!

alexdowad commented Feb 18, 2023

Uh oh!

alexdowad commented Feb 22, 2023

Uh oh!

Girgias left a comment

Choose a reason for hiding this comment

Uh oh!

alexdowad commented Feb 22, 2023

Uh oh!

youkidearitai commented Feb 22, 2023

Uh oh!

Uh oh!

youkidearitai commented Feb 15, 2023 •

edited

Loading

zbenc commented Feb 15, 2023 •

edited

Loading

alexdowad commented Feb 17, 2023 •

edited

Loading